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Abstract 

We study the if-armed dueling bandit problem, a variation of the standard stochastic bandit prob¬ 
lem where the feedback is limited to relative comparisons of a pair of arms. We introduce a tight 
asymptotic regret lower bound that is based on the information divergence. An algorithm that is 
inspired by the Deterministic Minimum Empirical Divergence algorithm (Honda and Takemura, 
2010) is proposed, and its regret is analyzed. The proposed algorithm is found to be the first one 
with a regret upper bound that matches the lower bound. Experimental comparisons of dueling 
bandit algorithms show that the proposed algorithm significantly outperforms existing ones. 
Keywords: multi-armed bandit problem, dueling bandit problem, online learning 


1. Introduction 

A multi-armed bandit problem is a crystallized instance of a sequential decision-making problem 
in an uncertain environment, and it can model many real-world scenarios. This problem involves 
conceptual entities called arms, and a forecaster who tries to identify good arms from bad ones. At 
each round, the forecaster draws one of the K arms and receives a corresponding reward. The aim 
of the forecaster is to maximize the cumulative reward over rounds, which is achieved by running an 
algorithm that balances the exploration (acquisition of information) and the exploitation (utilization 
of information). 

While it is desirable to obtain direct feedback from an arm, in some cases such direct feedback is 
not available. In this paper, we consider a version of the standard stochastic bandit problem called 
the AT-armed dueling bandit problem (Yue et al., 2009), in which the forecaster receives relative 
feedback, which specifies which of two arms is preferred. Although the original motivation of the 
dueling bandit problem arose in the field of informafion refrieval, learning under relative feedback 
is universal fo many fields, such as recommender sysfems (Gemmis ef al., 2009), graphical design 
(Brochu el al., 2010), and nalural language processing (Zaidan and Callison-Burch, 2011), which 
involve explicil or implicif feedback provided by humans. 
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Related work: Here, we briefly discuss the literature of the itT-armed dueling bandit problem. The 
problem involves a preference matrix M = {fJ-ij} G whose ij entry fiij corresponds to the 

probability that arm i is preferred to arm j. 

Most algorithms assume that the preference matrix has certain properties. Interleaved Filter (IF) 
(Yue et al., 2012) and Beat the Mean Bandit (BTM) (Yue and Joachims, 2011), early algorithms 
proposed for solving the dueling bandit problem, require the arms to be totally ordered, that is, 
i y j 4^ > 1/2. Moreover, IF assumes stochastic transitivity: for any triple {i,j,k) with 

i j k, /Xj fc}. Unfortunately, stochastic transitivity does not hold in many 

real-world settings (Yue and Joachims, 2011). BTM relaxes this assumption by introducing relaxed 
stochastic transitivity: there exists 7 > 1 such that for all pairs (j, k) with 1 j k, 7 /xi ^ > 
max{|Ui j, holds. The drawback of BTM is that it requires the explicit value of 7 on which 
the performance of the algorithm depends. Urvoy et al. (2013) considered a wide class of sequential 
learning problems with bandit feedback that includes the dueling bandit problem. They proposed the 
Sensitivity Analysis of VAriables for Generic Exploration (SAVAGE) algorithm, which empirically 
outperforms IE and BTM for moderate K. Among the several versions of SAVAGE, the one called 
Condorcet SAVAGE makes the Condorcet assumption and performed the best in their experiment. 
The Condorcet assumption is that there is a unique arm that is superior to the others. Unlike the two 
transitivity assumptions, the Condorcet assumption does not require the arms to be totally ordered 
and is less restrictive. IE, BTM, and SAVAGE either explicitly require the number of rounds T, or 
implicitly require T to determine the confidence level 5. 

Recently, an algorithm called Relative Upper Confidence Bound (RUCB) (Zoghi et al., 2014b) 
was proven to have an 0{K log T) regret bound under the Condorcet assumption. RUCB is based on 
the upper confidence bound index (Lai and Robbins, 1985; Agrawal, 1995; Auer et al., 2002) that is 
widely used in the field of bandit problems. RUCB is horizonless: it does not require T beforehand 
and runs for any duration. Zoghi et al. (2015) extended RUCB into the mergeRUCB algorithm 
under the Condorcet assumption as well as the assumption that a portion of the preference matrix is 
informative (i.e., different from 1/2). They reported that mergeRUCB outperformed RUCB when K 
was large. Ailon et al. (2014) proposed three algorithms named Doubler, MultiSBM, and Sparring. 
MultiSBM is endowed with an 0{KlogT) regret bound and Sparring was reported to outperform 
IE and BTM in their simulation. These algorithms assume that the pairwise feedback is generated 
from the non-observable utilities of the selected arms. The existence of the utility distributions 
associated with individual arms restricts the structure of the preference matrix. 

In summary, most algorithms either has 0{K‘^ logT) regret under the Condorcet assumption 
(SAVAGE) or require additional assumptions to achieve 0(iT log T) regret (IE, BTM, MultiSBM, 
and mergeRUCB). To the best of our knowledge, RUCB is the only algorithm with an 0{K log T) 
regret bound^. The main difficulty of the dueling bandit problem lies in that, there are iT — 1 
candidates of actions to test “how good” each arm i is. A naive use of the confidence bound requires 
every pair of arms to be compared 0(log T) times and yields an 0{K^ log T) regret bound. 
Contribution: In this paper, we propose an algorithm called Relative Minimum Empirical Diver¬ 
gence (RMED). This paper contributes to our understanding of the dueling bandit problem in the 
following three respects. 

• The regret lower bound: Some studies (e.g., Yue et al., 2012) have shown that the Ff-armed 
dueling bandit problem has a D(iTlogT) regret lower bound. In this paper, we further ana- 

1. Zoghi et al. (2013) first proposed RUCB with an 0(K^ logT) regret bound and later modified it by adding a ran¬ 
domization procedure to assure 0(K log T) regret in Zoghi et al. (2014b). 
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lyze this lower bound to obtain the optimal constant factor for models satisfying the Condorcet 
assumption. Furthermore, we show that the lower bound is the same under the total order as¬ 
sumption. This means that optimal algorithms under the Condorcet assumption also achieve 
a lower bound of regret under the total order assumption even though such algorithms do not 
know that the arms are totally ordered. 

• An optimal algorithm: The regret of RMED is not only 0{K\ogT), but also optimal in 
the sense that its constant factor matches the asymptotic lower bound under the Condorcet 
assumption. RMED is the first optimal algorithm in the study of the dueling bandit problem. 

• Empirical performance assessment: The performance of RMED is extensively evaluated by 
using five dafasefs: fwo synfhefic dafasefs, one including preference dafa, and fwo including 
ranker evaluafions in fhe information refrieval domain. 


2. Problem Setup 


The iT-armed dueling bandit problem involves K arms that are indexed as [K] = {1,2,..., K}. 
Eet M G be a preference matrix whose ij entry Hij corresponds to the probability that 

arm i is preferred to arm j. At each round t = 1,2,..., T, the forecaster selects a pair of arms 
{l{t),m{t)) G then receives a relative feedback Bernoulli(/i,(i)^^(t)) that 

indicates which of {l{t),m{t)) is preferred. By definition, = 1 — fij^i holds for any i,j G [K] 
and = 1/2. 

Eet Nij{t) be the number of comparisons of pair (t, j) and be the empirical estimate of 

Hij at round t. In building statistics by using the feedback, we treat pairs without taking their order 
into consideration. Therefore, for i ^ j, Nij{t) = = i,rn{t') = j} + = 

= i}) and = 1} + = 

= ^}))/Xij{t), where ![•] is the indicator function. Eor j / i, let 
Niyj(t) be the number of times i is preferred over j. Then, = Ni^j{t)/Nij(t), where we 

set 0/0 = 1/2 here. Eet jj,i^i{t) = 1/2. 

Throughout this paper, we will assume that the preference matrix has a Condorcet winner 
(Urvoy et ah, 2013). Here we call an arm i the Condorcet winner if >1/2 for any j G [AT] \{z}. 
Without loss of generality, we will assume that arm 1 is the Condorcet winner. The set of preference 
matrices which have a Condorcet winner is denoted by Adc- We also define the set of preference 
matrices satisfying the total order by Ado C Adc; that is, the relation i ^ j 4^ fiij <1/2 induces 
a total order iff {/rjj} G Ado. 

Eet Aij = Hij — 1/2. We define the regret per round as r{t) = (Ai^j + Aij)/2 when the 
pair {i,j) is compared. The expectation of the cumulative regret, E[i2(T)] = E 
to measure the performance of an algorithm. The regret increases at each round unless the selected 
pair is = (1,1). 


2.1. Regret lower bound in the iL-armed dueling bandits 

In this section we provide an asymptotic regret lower bound when T —)• oo. Eet the superiors of 
arm i be a set Oi = {j\j G [AT], /rjj < 1/2}, that is, the set of arms that is preferred to i on average. 
The essence of the iF-armed dueling bandit problem is how to eliminate each arm i G[iT]\{l}by 
making sure that arm i is not the Condorcet winner. To do so, the algorithm uses some of the arms 
in Oi and compares i with them. 
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A dueling bandit algorithm is strongly consistent for model A4 C A4c iff it has E[i2(T)] = 
o(r“) regret for any a > 0 and any M G Ad. The following lemma is on the number of comparisons 
of suboptimal arm pairs. 

Lemma 1 (The lower bound on the number of suboptimal arm draws) (i) Let an arm i G [K] \ {1} 
and preference matrix M ^ Me be arbitrary. Given any strongly consistent algorithm for model 
Me, we have 



where d{p, q) = p log | + (1 — p) log is the KL divergence between two Bernoulli distributions 
with parameters p and q. ( ii ) Furthermore, inequality (1) holds for any M G Ado given any strongly 
consistent algorithm for Mo- 

Lemma 1 states that, for arbitrary arm j G Oi, an algorithm needs to make logT/d{pij, 1/2) 
comparisons between arms i and j to be convinced that arm i is inferior to arm j and thus i is not the 
Condorcet winner. Since the regret increase per round of comparing arm i with j is (Ai^j + Ai j)/2, 
eliminating arm i by comparing it with j incurs a regret of 

(Ai,i + Aij)logT 

2d{p^j,l/2) ■ 

Therefore, the total regret is bounded from below by comparing each arm i with an arm j that 
minimizes (2) and the regret lower bound is formalized in the following theorem. 


Theorem 2 (The regret lower bound) (i) Let the preference matrix M £ Me be arbitrary. For any 
strongly consistent algorithm for model Me, 


lim inf 

T^OO 


E[i2(r)] 

logT 


> 


E 

iG[X]\{l} 


Aij + Ai j 

mm-^- — 

jGOi 2d{pij,l/2) 


(3) 


holds, (ii) Furthermore, inequality (3) holds for any M G Ado given any strongly consistent algo¬ 
rithm for Mo- 


The proof of Lemma 1 and Theorem 2 can be found in Appendix B. The proof of Lemma 1 is 
similar to that of Lai and Robbins (1985, Theorem 1) for the standard multi-armed bandit problem 
but differs in the following point that is characteristic to the dueling bandit. To achieve a small regret 
in the dueling bandit, it is necessary to compare the arm i with itself if i is the Condorcet winner. 
However, we trivially know that pi^i = 1/2 without sampling and such a comparison yields no 
information to distinguish possible preference matrices. We can avoid this difficulty by evaluating 
Nij and in different ways. 


3. RMEDl Algorithm 

In this section, we first introduce the notion of empirical divergence. Then, on the basis of the 
empirical divergence, we formulate the RMED1 algorithm. 
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Algorithm 1 Relative Minimum Empirical Divergence (RMED) Algorithm 


1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 

16: 

17: 

18: 

19: 

20 : 


Input: K arms, f{K) > 0. a > 0 (RMED2EH, RMED2). T (RMED2EH). 

^ fl (RMEDl, RMED2) 

^ I [a log log T] (RMED2EH) 

Initial phase: draw each pair of arms L times. At the end of this phase, t = L{K — l)K/2. 
ifRMED2EHthen 
Eor each arm i E [K], fix b*{i) by (6). 

end if 

Lc,Lr ^ [K],L]sf ^ 0 . 

while t <T do 
ifRMED2then 

Draw all pairs {i, j) until it reaches Nij{t) > a log log L f ■(— f + 1 for each draw. 

end if 

for I (t) E Lc in an arbitrarily fixed order do 

[Algorithm 2 (RMEDl) 

Select m(t) by using < 

[Algorithms (RMED2, RMED2EH) 

Draw arm pair m{t)). 

Lr ^ Lr \ {((f)}. 

L]\f ■(r- Ljv U {y} (without a duplicate) for any j ^ Lr such that Jj{t) holds. 
t^t + 1. 

end for 

Lc, Lr ^ Ln, Ln ^ 0 . 

end while 


Algorithm 2 RMEDl subroutine for selecting m{t) 


1: 

Omit) 

^{jE[iT]\{f(f)}|Az(t)j(f)<l/2} 

2: 

if i*{t) 

= Omit) or di(t)it) = 0 then 

3: 

m{t) 

^i*it). 

4: 

else 


5: 

m{t) 

^ arg minjy,(t) 

6: 

end if 



3.1. Empirical divergence and likelihood function 

In inequality (1) of Section 2.1, we have seen that Yhj^Oi the sum of the di¬ 

vergence between //j j and 1/2 multiplied by the number of comparisons between i and j, is the 
characteristic value that defines the minimum number of comparisons. The empirical estimate of 
this value is fundamentally useful for evaluating how unlikely arm i is to be the Condorcet winner. 
Eet the opponents of arm i at round t be the set Oi{t) = {j\j E [K] \ (t) < 1/2}. Note that, 

unlike the superiors Oi, the opponents Oi{t) for each arm i are defined in terms of the empirical 
averages, and thus the algorithms know who the opponents are. Eet the empirical divergence be 

jGOiit) 
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The value exp (—/j(t)) can be considered as the “likelihood” that arm i is the Condorcet winner. Let 
i*{t) = arg minjg[;^] Ii{t) (ties are broken arbitrarily) and I*{t) = By definition, I*{t) > 

0. RMED is inspired by the Deterministic Minimum Empirical Divergence (DMED) algorithm 
(Honda and Takemura, 2010). DMED, which is designed for solving the standard /f-armed bandit 
problem, draws arms that may be the best one with probability D(l/t), whereas RMED in the 
dueling bandit problem draws arms that are likely to be the Condorcet winner with probability 
D(l/f). Namely, any arm i that satisfies 

Ji{t) = {hit) - Pit) < logf + fiK)] (4) 


is the candidate of the Condorcet winner and will be drawn soon. Here, f{K) can be any non¬ 
negative function of K that is independent of t. Algorithm 1 lists the main routine of RMED. 
There are several versions of RMED. Eirst, we introduce RMEDl. RMEDl initially compares all 
pairs once (initial phase). Eet Tinit = [K — l)Kj2 be the last round of the initial phase. Erom 
t = Tinit + it selects the arm by using a loop. Lc = Lcit) is the set of arms in the current loop, 
and Lji = Lji{t) C Lcit) is the remaining arms of Lc that have not been drawn yet in the current 
loop. Lat = L^it) is the set of arms that are going to be drawn in the next loop. An arm i is put 
into Ln when it satisfies {J^iit) Cl {i ^ Ljiit)}}. By definition, at least one arm (i.e. i*(t) at the end 
of the current loop) is put into Ln in each loop. Eor arm ((f) in the current loop, RMEDl selects 
m(f) (i.e. the comparison target of ((f)) determined by Algorithm 2. 

The following theorem, which is proven in Section 5, describes a regret bound of RMEDl. 


Theorem 3 For any sufficiently small <5 > 0, the regret of RMEDl is bounded as: 
((l + <5)logr + /(iC))Ai,i 


E[RiT)] < 

ie[K]\{i} 


1 , 1 / 2 ) 


+ OiK^) + O 


where A = A a constant as a function ofT. Therefore, by letting 6 = log T 

and choosing an fiK) = for arbitrary c, e > 0, we obtain 


E[ii(r)] < 


E 

iG[Ki\{l} 


Agi logT 
2digLi,i, 1 / 2 ) 


+ 0(A:2+^) + 0(A:iog2/3 T). 


3.2. Gap between the constant factor of RMEDl and the lower bound 

Erom the lower bound of Theorem 2, the OiKiogT) regret bound of RMEDl is optimal up to 
a constant factor. Moreover, the constant factor matches the regret lower bound of Theorem 2 if 
6 *(i) = 1 for all i G [K] \ {1} where 


Kih 


. Ai j + Ai j 

arg mm —-—f. 

jeOi dinij,l/2) 


(5) 


Here we define d'^ip,q) = d(jp,q) if p < g and 0 otherwise, and x/0 = +oo. Note that, there 
can be ties that minimize the RHS of (5). In that case, we may choose any of the ties as 6*(f) to 
eliminate arm i. Eor ease of explanation, we henceforth will assume that 6*(i) is unique, but our 
results can be easily extended to the case of ties. 

We claim that 6*(i) = 1 holds in many cases for the following mathematical and practical 
reasons, (i) The regret of drawing a pair (i,y), j / 1, is (Ai^j + Aij)/2, whereas it is simply 
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Algorithm 3 Subroutine for selecting m{t) in RMED2 and RMED2FH 


if RMED2 then 

Update b*(l{t)) by (6). 

end if 

^ {j G [K] \ mmmit) < 1/2}. 


5: G Ozp)(f) and 




{t)/log logt 
{t)/log logT 


6 : m{t) b*{l{t)). 

7: else 

8 : Select m{t) by using Algorithm 2. 

9: end if 


(RMED2) 

then 

(RMED2EH) 


Ai j/2 for the pair (i, 1). Thus, 1/2) has to be much larger than 1/2) in order to 

satisfy b*{i) = j. (ii) The Condorcet winner usually wins over the other arms by a large margin, 
and therefore, 1/2) > d'^{iJ,ij, 1/2). For example, in the preference matrix of Example 1 

(Table 1(a)), b*{3) = 1 as long as q < 0.79. Example 2 (Table \{b)) is a preference matrix based 
on six retrieval functions in the full-text search engine of ArXiv.org (Yue and Joachims, 2011)^. In 
Example 2, 6*(z) = 1 holds for all i, even though /ri ^4 < /i 2 , 4 - In the case of a 16-ranker evalua¬ 
tion based on the Microsoft Eearning to Rank dataset (details are given in Section 4), occasionally 
b*{i) / 1 occurs, but the difference between the regrets of drawing arm 1 and h*{i) is fairly small 
(smaller than 1.2% on average). Nevertheless, there are some cases in which comparing arm i with 
1 is not such a clever idea. Example 3 (Table 1(c)) is a toy example in which comparing arm i with 
b*{i) / 1 makes a large difference. In Example 3, it is clearly better to draw pairs (2, 4), (3, 2) and 
(4, 3) to eliminate arms 2, 3, and 4, respectively. Accordingly, it is still interesting to consider an 
algorithm that reduces regret by comparing arm i with b*{i). 


Table 1: Three preference matrices. In each example, the value at row i, column j is Hij. 


(a) Example 1 



1 

2 

3 

1 

0.5 

0.7 

0.7 

2 

0.3 

0.5 

q 

3 

0.3 

1-q 

0.5 


(b) Example 2 



1 

2 

3 

4 

5 

6 

1 

0.50 

0.55 

0.55 

0.54 

0.61 

0.61 

2 

0.45 

0.50 

0.55 

0.55 

0.58 

0.60 

3 

0.45 

0.45 

0.50 

0.54 

0.51 

0.56 

4 

0.46 

0.45 

0.46 

0.50 

0.54 

0.50 

5 

0.39 

0.42 

0.49 

0.46 

0.50 

0.51 

6 

0.39 

0.40 

0.44 

0.50 

0.49 

0.50 


(c) Example 3 



1 

2 

3 

4 

1 

0.5 

0.6 

0.6 

0.6 

2 

0.4 

0.5 

0.9 

0.1 

3 

0.4 

0.1 

0.5 

0.9 

4 

0.4 

0.9 

0.1 

0.5 


3.3. RMED2 Algorithm 

We here propose RMED2, which gracefully estimates b*(i) during a bandit game and compares 
arm i with b*(i). RMED2 and RMEDl share the main routine (Algorithm 1). The subroutine of 
RMED2 for selecting m(t) is shown in Algorithm 3. Unlike RMEDl, RMED2 keeps drawing pairs 
of arms (i,j) at least a log log f times (Fine 10 in Algorithm 1). The regret of this exploration is 
insignificant since 0(log log T) = o(log T). Once all pairs have been explored more than a log log t 

2. In the original preference matrix of Yue and Joachims (2011), p2,4 7 ^ 1 — ^4,2- To satisfy p2,4 = 1 — /r4,2, we 
replaced p 2,4 and /i 4,2 of the original with (p 2,4 — P 4,2 + l)/2 and (p 4.2 — M 2,4 + l)/2, respectively. 
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times, RMED2 goes to the main loop. RMED2 determines m{t) by using Algorithm 2 based on the 
estimate of b*{i) given by 


b*{i) = arg min 
jelK]\{i} 




( 6 ) 


where ties are broken arbitrarily, Aj j = and we set x/0 = +oo. Intuitively, RMED2 

tries to select m(f) = 6*(i) for most rounds, and occasionally explores i*{t) in order to reduce the 
regret increase when RMED2 fails to estimate the true b*{i) correctly. 


3.4. RMED2FH algorithm 

Although we believe that the regret of RMED2 is optimal, the analysis of RMED2 is a little bit 
complicated since it sometimes breaks the main loop and explores from time to time. Eor ease of 
analysis, we here propose RMED2 Eixed Horizon (RMED2EH, Algorithm 1 and 3), which is a 
“static” version of RMED2. Essentially, RMED2 and RMED2EH have the same mechanism. The 
differences are that (i) RMED2EH conducts an a log log T exploration in the initial phase. After 
the initial phase (ii) h*{i) for each i is fixed throughout the game. Note that, unlike RMEDl and 
RMED2, RMED2EH requires the number of rounds T beforehand to conduct the initial a log log T 
draws of each pair. The following Theorem shows the regret of RMED2EH that matches the lower 
bound of Theorem 2. 


Theorem 4 For any sufficiently small <5 > 0, the regret of RMED2FFI is bounded as: 


E[i?(r)] < 


E 

iG[K]\{l} 


(Ai^j + + 6) logT) 

2d(^j 1/2) 


+ 0{aK^ log log T)+0) 


+ 0 


( K\ogT \ 

Viogiogry 



+ 0{Kf{K)), 


(7) 


where A = > 0 is a constant as a function ofT. By setting 6 = 0((logT) and 

choosing an f{K) = (c, e > 0) we obtain 


E[E(T)] 


s E 

i€[K]\{l} 


logT 

2(i(^j (j), 1/2) 


+ 0{aK^ loglogr) + 0 


( KlogT \ 
\loglogT J 


+ O . 

( 8 ) 


Note that all terms except the first one in (8) are o(log T). Erom Theorems 2 and 4 we see that (i) 
RMED2EH is asymptotically optimal under the Condorcet assumption and (ii) the logarithmic term 
on the regret bound of RMED2EH cannot be improved even if the arms are totally ordered and the 
forecaster knows of the existence of the total order. The proof sketch of Theorem 4 is in Section 5. 

4. Experimental Evaluation 

To evaluate the empirical performance of RMED, we conducted simulations^ with five bandif 
dafasefs (preference mafrices). The dafasefs are as follows: 

3. The source code of the simulations is available at https://githuh.com/jkomiyama/duelingbanditlib. 


8 








Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem 





(a) Six rankers 


(b) Cyclic 


(c) Arithmetic 





(d) Sushi 


(e) MSLR K = 16 


if) MSLR K = 6A 


Figure 1: Regret-round log-log plots of algorithms. 


Six rankers is the preference matrix based on the six retrieval functions in the full-text search engine 
of ArXiv.org (Table lib)). 

Cyclic is the artificial preference matrix shown in Table 1(c). This matrix is designed so that the 
comparison of i with 1 is not optimal. 

Arithmetic dataset involves eight arms with = 0.5 -|- 0.05(j — i) and has a total order. 

Sushi dataset is based on the Sushi preference dataset (Kamishima, 2003) that contains the pref¬ 
erences of 5,000 Japanese users as regards 100 types of sushi. We extracted the 16 most popular 
types of sushi and converted them into arms with fj,ij corresponding to the ratio of users who prefer 
sushi i over j. The Condorcet winner is the mildly-fatty tuna (chu-toro). 

MSLR: We tested submatrices of a 136 x 136 preference matrix from Zoghi et al. (2015), which is 
derived from the Microsoft Learning to Rank (MSLR) dataset (Microsoft Research, 2010; Qin et ah, 
2010) that consists of relevance information between queries and documents with more than 30K 
queries. Zoghi et al. (2015) created a finite set of rankers, each of which corresponds to a ranking 
feature in the base dataset. The value j is the probability that the ranker i beats ranker j based on 
the navigational click model (Hofmann et ah, 2013). We randomly extracted K = 16,64 rankers 
in our experiments and made sub preference matrices. The probability that the Condorcet winner 
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(a) Six rankers (b) Cyclic (c) MSLR K = 16 

Figure 2: Regret-round semilog plots of RMED compared with theoretical bounds. We set f{K) = 
for all algorithms, and a = 3 for RMED2. 


exists in the subset of the rankers is high (more than 90%, c.f. Eigure 1 in Zoghi et al. (2014a)), and 
we excluded the relatively small case where the Condorcet winner does not exist. 

A Condorcet winner exists in all datasets. In the experiments, the regrets of the algorithms were 
averaged over 1,000 runs (Six rankers. Cyclic, Arithmetic, and Sushi), or 100 runs (MSER). 

4.1. Comparison among algorithms 

We compared the IE, BTM with 7 = 1.2, RUCB with a = 0.51, Condorcet SAVAGE with 6 = 1/T, 
MultiSBM and Sparring with a = 3, and RMED algorithms. There are two versions of RUCB: 
the one that uses a randomizer in choosing l{t) (Zoghi et ah, 2014b), and the one that does not 
(Zoghi et ah, 2013). We implemented both and found that the two perform quite similarly: we show 
the result of the former one in this paper. We set f{K) = for all RMED algorithms and 

set a = 3 for RMED2 and RMED2EH. The effect of f{K) is studied in Appendix A. Note that IE 
and BTM assume a total order among arms, which is not the case with the Cyclic, Sushi, and MSER 
datasets. MultiSBM and Sparring assume the existence of the utility of each arm, which does not 
allow a cyclic preference that appears in the Cyclic dataset. 

Eigure 1 plots the regrets of the algorithms. In all datasets RMED significantly outperforms 
RUCB, the next best excluding the different versions of RMED. Notice that the plots are on a base 
10 log-log scale. In particular, regret of RMEDl is more than twice smaller than RUCB on all 
datasets other than Cyclic, in which RMED2 performs much better. Among the RMED algorithms, 
RMED 1 outperforms RMED2 and RMED2EH on all datasets except for Cyclic, in which comparing 
arm f / 1 with arm 1 is inefficient. RMED2 outperforms RMED2EH in the five of six dafasefs: fhis 
could be due fo fhe facl fhaf RMED2EH does nol updafe b*{i) for ease of analysis. 

4.2. RMED and asymptotic bound 

Eigure 2 compares the regret of RMED with two asymptotic bounds. EB1 denotes the regret bound 
of RMEDl. TrueEB is the asymptotic regret lower bound given by Theorem 2. 

RMEDl and RM E D2: When T —)• oo, the slope of RMEDl should converge to EB 1, and the ones 
of RMED2 and RMED2EH should converge to TrueEB. On Six rankers, EBl is exactly the same 
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as TrueLB, and the slope of RMEDl converges to this TrueLB. In Cyclic, the slope of RMED2 
converges to TrueEB, whereas that of RMEDl converges to EBl, from which we see that RMED2 
is actually able to estimate b*{i) / 1 correctly. In MSER K = 16, EBl and TrueEB are very close 
(the difference is less than 1.2%), and RMEDl and RMED2 converge to these lower bounds. 
RMED2FH with different values of a: We also tested RMED2EH with several values of a. On 
the one hand, with a = 1, the initial phase of RMED2EH is too short to identify as a result 
it performs poorly on the Cyclic dataset. On the other hand, with a = 10, the initial phase was 
too long, which incurs a practically non-negligible regret on the MSER iT = 16 dataset. We also 
tested several values of parameter a in RMED2EH. We omit plots of RMED2 with a = 1, 10 for 
the sake of readability, but we note that in our datasets the performance of RMED2 is always better 
than or comparable with the one of RMED2EH under the same choice of a, although the optimality 
of RMED2 is not proved unlike RMED2EH. 


5. Regret Analysis 

This section provides two lemmas essential for the regret analysis of RMED algorithms and proves 
the asymptotic optimality of RMEDl based on these lemmas. A proof sketch on the optimal regret 
of RMED2EH is also given. 

The crucial property of RMED is that, by constantly comparing arms with the opponents, the 
true Condorcet winner (arm 1) actually beats all the other arms with high probability. Eet 

^it)= n {AM(t)>i/2}. 

Under = 1 — < 1/2 for all i G [K] \ {1}, and thus, Ii{t) > 0. Therefore, l{{t) 

implies that i*{t) = arg miuj^j;^] Ij(f) is unique with i*{t) = 1 and I*{t) = Ii{t) = 0. Lemma 
5 below shows that the average number of rounds that occurs is constant in T, where the 

superscript c denotes the complement. 


Lemma 5 When RMEDl or RMED2FH is run, the following inequality holds: 


E 


T 


i=Tnit + l 




(9) 


where A = > 1) is a constant as a function ofT. 

Note that, since RMED2EH draws each pair [a log log T] times in the initial phase, we define 
Tinit = [g log log T] {K — l)K/2 for RMED2EH. We give a proof of this lemma in Appendix C. 
Intuitively, this lemma can be proved from the facts that arm 1 is drawn within roughly QE(t)-f{K) 
rounds and I\{t) is not very large with high probability. 

Next, for i G [K] \ {1} and j G Oi, let 



{l + 5)\ogT + f{K) ^ 


which is a sufficient number of comparisons of i with j to be convinced that the arm i is not the 
Condorcet winner. The following lemma states that if pair (i,j) is drawn times then i is 

rarely selected as Z(t) again. 
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Lemma 6 When RMEDl or RMED2FH is run, for i G [X] \ {1}, j G Oi, 


E 


10(f) =i,Wy(t)>JVg'(«)} 


i—Tinit + l 


= o 




We prove this lemma in Appendix D based on the Chemoff bound. 

Now we can derive the regret bound of RMEDl based on these lemmas. 

Proof of Theorem 3: Since U{t) implies m{t) = 1 in RMEDl, the regret increase per round can 
be decomposed as: 


(i) = i{M0«)}+ E 


A 


= 

Using Eemmas 5 and 6, we obtain 

T 

E[i?(r)] < Jinit + ^ [r(f)] 

i=^init + l 
T 


1,2 




( 10 ) 


ie[K]\{i} 


K(K-l) 
<—^-^+E 


4=7imt + l 




+ E ^ US'(< 5 )+EiP(‘) = l.iVu(*)>A'S'('5)l 


ie[iv]\{i} 


t=i 


< KiK-l) ^ A 


which immediately completes the proof of Theorem 3. 


1 


+ O ( ^ ) + + k), 


We also prove Theorem 4 on the optimality of RMED2EH based on Eemmas 5 and 6. Because 
the full proof in Appendix E is a little bit lengthy, here we give its brief sketch. 

Proof sketch of Theorem 4 (RMED2FH): Similar to Theorem 3, we use the fact that the does 

not occur very often (i.e., Eemma 5). Under U{t), we decompose the regret into the contributions 
of each arm i G [K] \ {!}. There exists 6*2 > 0 such that, for each l{t) = i, (i) with probability 1 — 
0((log RMED2PH successfully estimates b*{i) = b*{i) and selects m{t) = b*{i) for most 

rounds. The optimal 0(log T) term comes from the comparison of i and b*{i). Arm 1 is also drawn 
for 0(logT/loglogr) = o(logT) times. On the other hand, (ii) with probability 0{i\ogT)~^‘^), 
RMED2PH fails to estimate h*{i) correctly. By occasionally comparing arm i with arm 1, we 
can bound the regret increase by ©(logTloglogT). Since 0 ((logT )“‘^2 x log T log log T) = 
o(log T), this regret does not affect the 0(log T) factor. 


6. Discussion 

We proved the regret lower bound in the dueling bandit problem. The RMED algorithm is based 
on the likelihood that the arm is the Condorcet winner. RMED is proven to have the matching 
regret upper bound. The empirical evaluation revealed that RMED significantly outperforms the 
state-of-the-art algorithms. To conclude this paper, we mention three directions of future work. 
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First, when a Condorcet winner does not necessarily exist, the Copeland bandits (Urvoy et ah, 
2013) are a natural extension of our problem. Thus, seeking an effective algorithm for solving this 
problem will be interesting. As is well known in the field of voting theory, there are several other 
criteria of winners that are incompatible with the Condorcet / Copeland bandits, such as the Borda 
winner (Urvoy et ah, 2013). Comparing several criteria or developing an algorithm that outputs 
more than one of these winners should be interesting directions of future work. 

Second, another direction is sequential preference elicitation problems under relative feedback 
that goes beyond the binary preference over pairs, such as multiscale feedback and/or preferences 
among three or more items. 

Third, in the standard bandit problem, it is reported that KL-UCB+ (Lai, 1987 ; Garivier and Cappe, 
2011) performs better than DMED. A study of a UCB-based optimal algorithm for the dueling ban¬ 
dits can yield an algorithm that outperforms RMED. 
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K: # of arms 


Figure 3: Performance of RMEDl algorithm with several values of c. The plot shows the regret at 
T = 10^ in the MSLR dataset with K = 16, 32, 64, and 128. 


Appendix A. Experiment: Dependence on f\K) 

The event U^{t) implies a failure in identifying the Condorcet winner (i.e., 1 / Although 

is a constant function of T for any non-negative f{K), this term 
is not negligible with large K. To evaluate the effect of f{K), we set f{K) = and studied 

several values of c with the MSLR dataset (Figure 3). In the case of c = 0, the regret for K = 128 
becomes 100 times that for K = 16, which implies that the exponential dependence may 

not be an artifact of the proof. On the other hand, the results for c = 0.1, 0.3, and 1 indicate that 
this term can be much improved by simply letting c be a small positive value. 

Appendix B. Proofs on Regret Lower Bound 
B.l. Proof of Lemma 1 

Let i G [iT] \ {1} be arbitrary and M = {//ij} be an arbitrary preference matrix. We consider a 
modified preference matrix M' in which the probabilities related to arm i are different from M. Let 
O' = {j\j G < 1/2}, that is. O' = O* U {j\j G [K], = 1/2}. For j G O', ij element 

of M' is /i' ■ such that 

^ j ~ 1/2) + e. (11) 

Such a /i'j > 1/2 uniquely exists for sufficiently small e > 0 by the monotonicity and continuity 
of the KL divergence. For j ^ O', let /r' j = mj. Note that, unlike the original bandit problem, in 
the modified bandit problem the Condorcet winner is not arm 1 but arm i. Moreover, if M G Ado 
then M' G Ado. 
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Notation: now, let X™- G {0, 1} be the result of m-th draw of the pair {i, j). 


n 

KLj{n) = log 

m=l 


/ + (1 - A"})!! 
[xZK, + (1 - L”)(i 



and KL = Yhj^o'- ’^he probability and the expectation with respect to 

the modified bandit game. Then, for any event £, 


P'(£:) = E [l{f} exp (^-KL) 


( 12 ) 


holds. Let us define fhe evenfs 


^i = \Y < (1 - e)iogr,x,,,(r) < v/r 

U'eO' 

2^2 = {s. < (i -1) logr}, 

^^12 = 2?! n V 2 , 

2^1\2 = 2^1 n ^2- 


First step (P{Pi 2 } = o(l)): From (12), 


P'(Pl2) > E 


l{Pi 2 }exp 



)iogr) = r“(^""/2)p{Pi2}. 


By using this we have 

p{22i 2} < r(^“"/^¥'(Pi2) 

< |iVi,i(T) < v/r| 

< _ AT. .(T) > T - Vrj 

< E [T - Xy(r)] Markov inequality). 


(13) 


(14) 


Since this algorithm is strongly consistent, E'[T — Ni^i{T)] — o{T°') for any a > 0. Therefore, the 
RHS of the last line of (14) is o(T““‘^/^), which, by choosing sufficiently small a, converges to zero 
as T —)■ 00 . In summary, P{Pi 2 } = o(l). 

Second step (P{22iy2} = o(l)): We have 


P{2?i\2} 


< (1 - e)logT,Xi,i(r) <Vt,Y ^j{NijiT)) > (l - |) logT 






< 


{%}GN'°“'.Ejeo' logT^^0/ 


max 


j;ia,(n,)> (l-|)logrl. 
)GO' J 
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Note that 


max KL, (n) = max log 
l<n<Ar 


+ (1 - ^’SXl - Kj)' 


\ + (1 - A ”)(1 - f ' j ), 


is the maximum of the sum of positive-mean random variables, and thus converges to is average 
(c.f., Lemma 10.5 in Bubeck, 2010). Namely, 


N—>oo l<n<N N 

Let (5 > 0 be sufficiently small. We have. 


lim max — ^ a.s. 


(15) 


max 




logT 




^.)<(l-e) logT 


logT 

Combining this with the fact that (15) holds for any j, we have 


-h — 


5K 


minjeo/ 


^lo'i 


lim sup ■ 
N—^OO 


logT 


E,eO' KLj(nj) 


< 1 — e a.s.. 


and thus 


lim sup ■ 

T^oo 


■)<{!-€) logT KLj(nj^ 


logT 


< 1 — e -h 0((5) a.s. (16) 


By using the fact that (16) holds almost surely for any sufficiently small <5 > 0 and 1 — e/2 > 1 — e, 
we have 


{nj }eNl°i I .Ejgo' ■)<{!-€) log T 

In summary, we obtain P {T^\ 2 } = o(l). 

Last step: We here have 


max 


KLjirij) > (l - -j logT = o(l). 


Ti = 


< (1 - e) logT i n {at. i(T) 




< VT 


Nij{T){d{f3,j, 1/2) + e) < (1 - e) logT I n {Ni^T) 




< VT 


D 


Y NijiT)idif3,j, 1/2) + e) + (1 ^°g^ iv,,(T) < (1 - e) 




-e) logTV, 


(By (11)) 


(17) 
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where we used the fact that {A < C} n {B < C} ^ {A + B < C} for A, i? > 0 in the last line. 
Note that, by using the result of the previous steps, IP{Pi} = P{Pi 2 } +P{Pi\ 2 } = o(l). By using 
the complementary of this fact. 


^ 1/2) + e) + > (1 - e) logT ^ > P{PJ} = 1 - o(l). 




Using the Markov inequality yields 


Vt 


E + [ > (1 - 6)(1 - o(l)) log T. 




(18) 


Because E[iVj^j(r)] is subpolynomial as a function of T due to the consistency, the second term in 
LHS of (18) is o(l) and thus negligible. Lemma 1 follows from the fact that (18) holds for suffi¬ 
ciently small e. ■ 


B.2. Proof of Theorem 2 

We have 

1 




i “h i / \ \—^ i i 


> E + y: 


i(i[K] 


N^AT) 


> 


< 1/2 

E E 

i€[K]\{i} jeOi 




E E 


^l,i + 


2d(^*j,l/2) 


iGlK] 

d{/Lij, l/2)Nij{T). 


i&[K]\{i} jeOi 

Taking the expectation on both sides and using Lemma 1 yield 

+ ^l,j 


E[R{T)]> E ““ 


iG[X]\{l} 


j€Oi 2dip,ij,l/2) 


(1 - o(l))logT. 


Appendix C. Proof of Lemma 5 

This lemma essentially states that, the expected number of the rounds in which arm 1 is underesti¬ 
mated is 0(1). We show this by bounding the expected number of rounds before arm 1 is compared. 
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for each fixed set of {A^i ^(t)} and summing over {A^i^s(t)}. This technique is inspired by Lemma 
16 in Honda and Takemura (2010). Note that 


= U ) n < 1/2} n fl > 1/2} 

5g2[^l\{i}\{0} IsgS s^S 


(19) 


Now we bound the number of rounds that the event 


< 1/2} n f|{Ai,.(t) > 1/2} 

s^S 


ses 


occurs. Let N be the set of non-zero natural numbers, E N and Xg E [0, log 2] be arbitrary 
for each s £ S. Let be the empirical estimate of fiij at n-th draw of pair (i, j). If {Ai ^ < 
1/2, d+(Ai“^, 1/2) = Xs, = iT-s} holds for s E 5 and Ai,s(^) >1/2 holds for s ^ 5 then 


h{t) = ^n^cL^(Ai,s(t), 1/2) 

seS 


and therefore ^i(t) holds for any 


t > exp 


^re^d+(Ai,s(t), 1/2) -/(iT) j . 

\s£S / 


If Ji{t) occurs, then arm 1 is in Ljv of the next loop, and thus for some s £ S, Ni^g is incremented 
within 2K rounds. Therefore we have 


E 1 


t —Tinit + l LsGS" 


P|{Ai.s(0 < 1/2, = Us} n P|{Ai,s(i) > 1/2} 


s^S 


< exp ( ^ nsci+(AiA, 1/2) - f{K) j + 2K. 

\s£S / 


20 




Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem 


Letting Ps{xs) = PT[/ii * < 1/2, 1/2) > a;*], we have 


E 


T 

E 1 

Tlnit + l 


P|{/li,s(t) < 1/2, Ni^s{t) = Us} n P|{/li,s(t) > 1 / 2 } 

s^S 


.s£S 


' {a;s}G[0,Iog 2]l®l 


exp I ^ UsXs - f{K) ) + 2it: j d{-Ps{xs)) 
V VseS 

= e-^(^)(2iLn^^(0) + n / 

I sGS 

= ( 21^11 + n f + f 

[ seS seS V 


5 e[o,iog 2 ] 


s£S 

^nsX^di-PsiXs)) 


sG[0,log 2] 


nse"“^‘’Ps{xs)dxs 


(integration by parts) 


<e-/w|(l + 2iL)J]P,(0) + n f 

I s£S 


7i^e"''>3;sg-ns(3;s+Ci(Ati,s,l/2))^^^ I 

/a;se[0,log2] J 

(by the Chernoff bound and Fact 10, where Ci(/x, ^ 2 ) = (f ~ ^ 2 )^/(2/u(l — /U 2 ))) 


< I (1 + 2K) JJ e-^sd{l/2,^,l,s) + / 

I sGS seS-^^ 






a;sG[0,log 2] 


(1 + 2 iL) JJ 

. s^S s^S 


( 20 ) 


By summing (20) over {ns}, 

T 


E 


t —Tinit + l LsES” 


n{AM(i) < 1/2} n f|{Ai,s(t) > 1/2} 


s(^S 


<e ^ ^ ( (1 + 2iL) JJ e + JJ(log 2)nse «sC'i{r‘i,sd/2) 

{n4eNisi V 


ses 


s£S 


< .-li-o I (1+ 2K) n +(iog2)i*i n 

l s£S s£S 


,Ci(/.i.,,l/2) 


(eC'l(w,s,l/2) - 1)2 j ’ 


where we used the fact that e = l/(e^ + 1) and ^e = e^/(e^ + 1)^. Using 

(19) and the union bound over all S G 2[^PP1 \ {0}, we obtain 


E 


T 


E !{'''(*)} 

i=71nit + l 


< e-/W J (1 + 2iL) JJ (l + 


.G[X]\{1} 


.diiA/iu_i)+(10^2)"^' n (i+ 

^ .e[ir]\{i} V 


,Ci(mi.«,1/2) 


= 0(e^^-l(^)), 


(eCilw,0.1/2) _ ly 

( 21 ) 
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where A 


log |max^e[A']\{i} max [ 1 + ^a(i/ 2 X,s) 


-1 


,log2 1 + 


(e' 


Cl (mi 


, 1 / 2 ) 

am 


-1)^ 


Appendix D. Proof of Lemma 6 

Except for the first loop, arm i must put into Ljy before {l(t) = z}. For t > Tinit + K + 1 (i.e., after 
the first loop), let r(f) < f be the round in the previous loop in which arm l{t) is put into L^. In the 
round, is satisfied. Wifh fhis definifion, for any fwo rounds fi,t 2 > ^init + K + 1 such 

fhaf l{ti) = l{t 2 ) = i,ti ^ t 2 ^ / ''‘(^ 2 ) holds because T(fi) and r(t 2 ) belong fo differenf 

loops. By using r(f), we obfain 

T 

*= 711111+1 

T T 

<K+ Y. l|i(i) = i.«‘(T(«))l + E l[i(<)=Mll(T(i)).iV,i(«)>JV«(«)| 

i=7init+^ + l t=Tinit+^ + l 

<K+ Y 1[«+)1+ E 

t=7init + l *=7init + A' + l 

Nofe fhaf fhe expecfafion of ferm Ylt=T- t+i i^ bounded by Lemma 5. Befween r(f) and 

t, fhe only round in which pair (z, j) can be compared is fhe round of {l{t) = j} thaf occurs af mosf 
once, and fhus Nij{t) — Nij{T{t)) < 1. By using fhis facf, we obfain 

T 

Y mt) = mrit)), Ntj(t) > Nff(S)] 

*=7init + 7'+l 
T 

< Y lli(*)=i.J,(T(()),tl(T{()).Ary(T(())>JVg''(i)-l| 

i=^init + -^ + l 
T 

< Y. nJiit)Mt),Nij{t)>Nffi6)-l]. ( 22 ) 

i=^init + l 
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We can bound this term via Ii{t) as 


i=T'init + l 


s E 1 

T 

s E 1 

T 

s E 1 

n=riVSuf(5)-i] 

T 

s E 1 

n=\NffiS)-l] 


U <logt + f{K),Nij{t) 

i=^init + l 


= n 


{hy U{t) => Ii{t) = 0) 

U 1/2) < logi + 

i=Tlnit + l 

iNff{6) - l)d+ifilj, 1/2) < logT + f{K) 


(23) 


Therefore, by letting ^ G (l/2,|Ujj) be a real number such that (i(/U, 1/2) = vve 


obtain from the Chernoff bound and the monotonicity of d'^{-, 1/2) that 


E 




d+(/l^,,l/2)< 


n=riVSf(5)-ll 
T 

- E exp(-fi(/r,/iij) 


d{^i,j, 1/2) 

l + <5 


n 


< 


n=lNff{5)-l] 

1 


< 


1 


exp (d(/i, /ijj)) - 1 d{n, fiij) ■ 


From the Pinsker’s inequality it is easy to confirm that d{iJ,,Hij) = 12(5^), which completes the 
proof. ■ 


Appendix E. Optimal Regret Bound: Full Proof of Theorem 4 

Events: Define 

iJ&lK] 

for sufficienfly small buf fixed > 0. If is easy fo see from fhe condinuify of d'^{^ij, 1/2) 
in fiij fhaf 3^j implies b*{i) = b*{i) when we lef > 0 be sufficienfly small wifh respecf fo 
j^[K]- Lef also 

= {/ii^b*(j)(f) < 1/2}- 
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First step (regret decomposition); Like RMEDl, in RMED2FH E[(^(t)] holds with high proba¬ 
bility (i.e., Eemma 5). In the following, we bound the regret under let 


ri{t) 


= i,U{t)]r{t) 

'• -V-' '-V-' 

(A) (B) 


(24) 


In the following, we first bound the terms (A) and (B), and then summarizing all terms to prove 
Theorem 4. 

Second step (bounding (A)): Note that, {l{t) = i,U{t),yi, Zi{t)} is a sufficient condition for 
b*{i) = b*{i) and b*{i) € Oi{t). Therefore, 


l{l{t) =i,U{t),yi,Zi{t)}r{t) 

^ \ ^ 1 AT \ A^Suf I 4” ‘^l,b*(i) I 

< 2^ HKt) - hNi^b*(i)[t} > Ai,6.(i)(4)| +-^+ 2 log log r • 

t=7init-|-l 


By applying Lemma 6 with j = b*{i), for sufficiently small (5 > 0 we have 


E 


T 

sy -isui\ 


< o 




In summary, term (A) is bounded as: 


E 


T 


Y = iM{t),yi,Zi{t)]r{t) 

Ai,i + 


< 


® ® (^)+(«) 


Third step (bounding (B)): Now we consider the case {l{t) = {y^ U Zf{t)}}. Under 

this event b*{i) = b*{i) does not always hold but we can see that m{t) G {b*{i), 1} still holds. 
Furthermore, under this event arm b*{i) is selected as m{t) at most (loglogT)Aj^i(T) + 1 times 
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due to Line 5 of Algorithm 3. By using these facts, we have, 

T 


E 


l{l{t) =i,U{t),{yfUZf{t)}}r{t) 

t=7init + l 


< E 


< E 


T T 

y; !{((*) U («')}) 


y; iO{() = ,:,jVi.,(t)>jv5'(«)} 

i='Lnit + l 


< o 


+ P U U Zf{t') I [Nlf{6) log log T + 1 + Nff 

^^=7init + l J 

l\ + + iL + P I 3^f U U Zf{t') i O [Nff{6) log log t) 

^ I P=Tinit + l I 


.52 

(by Lemma 6). 

The following lemma bounds P |3^? U Ut'=r- t+i 
Lemma 7 For RMED2FH, there exists C 2 = C' 2 ({/Uiy}, AT, a) > 0 such that 


A'=rinit + l"^f(^0}- 


P<j3^fu U Zf{t)\=0{{logT)-^^). 

i=Tmt + l 


In summary, term (B) is bounded as: 


E 


i=7init + l 

< O ( + K + 0 (^Nffi6)ilogT)-^^ loglogr) 


(26) 
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Last step (regret bound): 


E[R{T)] < Tinit + ^ W 




^€[K]\{1} 


< Tinit + Y1 + E ••((A) + (B)) 


(by Lemma 5 and inequality (24)) 


7init + l 


iG[i^]\{l} 


< 0{aK^ loglogT) + 


iG[K]\{l} I 


j 


f logT \ 




< 


+ O + 2K + 0 (^Nlf{6){logT)-^^ log log t) 

(by (25) and (26)) 

0(aK2loglogr)+0(Ke“'-/(A-))+ ^ (Ag. + A.fBiXd + i) logD 

+ O {K{\og log log T)+0 (KfiK)). 


+ 0 


( K\ogT \ 


\log logT ) 


+ 0 


(27) 


Combining (27) with the fact that O [K (log T)^ log log T) = o i^\°gy ) completes the proof. 


E.l. Proof of Lemma 7 

We bound P{3^f} and IP{U^t t+i separately. On the one hand, 


P{3;f} = p < 


u tAi: 


[a log logT] 




p{i4y“8>“<!n _ > ^»t} 

< 2 exp (—2(A®“^)^a log log T) (by the Chemoff bound and Pinsker’s inequality) 


= ^ 2 (logT)"^(^l“')"" = 2K^ (logr)“2(^^“')"“ = 0((logr)-^“), 
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where Ca = > 0. On the other hand, 

T 


U ^f(*) 

^ i=^init + l 

p I U < 1/2 I < p ( U < 1/2} 


^init + 1 


. n= [q; log log T] 


— < 1/2} 
n=\a log log T] 
oo 

< ^ exp (—(i(l/2, |Uj (by the Chernoff bound) 

n=\a log log T] 

oo 


n=0 




^(1/2; (i) ) 1 


= o((iogr)-^^), 


where Cfe = a(i(l/2, iii^h*{i)) > 0- The proof is completed by letting C 2 = min {Ca, Cb) and taking 
the union bound of P{3^f} and P{U^Tinit+i ® 


Appendix F. Facts 
Fact 8 (The Chernoff bound) 

Let Xi,..., Xn be i.i.d. binary random variables. Let X = ^ ‘^^id ^ = E[A]. Then, for 

any e > 0, 

P(A > ti + e) < exp {—d{ti + e, ^)n) 

and 

P(X < /X — e) < exp {—d{fi — e, tj)n). 


Fact 9 (The Pinsker’s inequality) 

For p, q G (0,1), the KL divergence between two Bernoulli distributions is bounded as: 

d{p,q) > 2{p-qf. 

Fact 10 (A minimum difference between divergences (Lemma 13 in Honda and Takemura, 2010)) 
For any p and p 2 satisfying 0 < ^2 < fx < 1- Let Ci{p, P 2 ) = {h — 1 x 2 )^/(2/u(l — ^ 2 ))- Then, for 
any < p 2 , 

d{p3,T) - d{p3,p2) > Ci{p,p2) > 0. 
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